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C^ ■ Abstract. MKM has been defined as the quest for technologies to manage math- 

ematical knowledge. MKM "in the small" is well-studied, so the real problem 
is to scale up to large, highly interconnected corpora: "MKM in the large". We 
contend that advances in two areas are needed to reach this goal. We need rep- 
CO I resentation languages that support incremental processing of all primitive MKM 

operations, and we need software architectures and implementations that imple- 
^j^ ' ment these operations scalably on large knowledge bases. 

HH We present instances of both in this paper: the Mmt framework for modular 

theory-graphs that integrates meta-logical foundations, which forms the base of 
lyj ' the next OMDoc version; and TNTBase, a versioned storage system for XML- 

O ■ based document formats. TNTBase becomes an MMT database by instantiating 

it with special MKM operations for MMT. 
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(-\j ■ 1 Introduction 

m ' 

^N ' [12] defines the objective of MKM to be to develop new and better ways of managing 

, . mathematical knowledge using sophisticated software tools and later states the "Grand 

•/^ ' Challenge of MKM" as a universal digital mathematics library (UDML), which is in- 
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deed a grand challenge, as it envisions that the UDML would continuously grow and 
in time would contain essentially all mathematical knowledge, which is estimated to 
be well in excess of 10^ published pages. ' All current efforts towards comprehensive 
machine-organizable libraries of mathematics are at least three orders of magnitude 
^^ . smaller than the UDML envisioned by Farmer in 2004: Formal libraries like those of 

5^ ; Mizar ([33], Isabelle ([26]) or PVS ([25]) have ca. 10^-^ statements (definitions and the- 

orems). Even the semi-formal, commercial Wolfram MathWorld which hails itself the 
world's most extensive mathematics resource only has 10*^ entries. There is anecdotal 
evidence that already at this size, management procedures are struggling. 

To meet the MKM Grand Challenge will have to develop fundamentally more scal- 
able ways of dealing with mathematical knowledge, especially since [12] goes on to 
postulate that the UDML would also be continuously reorganized and consolidated as 
new connections and discoveries were made. Clearly this can only be achieved algo- 
rithmically; experience with the libraries cited above already show that manual MKM 



* The final publication of this paper is available at www.springerlink.com. 
' For instance, Zentralblatt Math contains 2.4 million abstracts of articles form mathematical 
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does not scale sufficiently. Most of the work in the MKM community has concentrated 
on what we could call "MKM in the small", i.e. dealing with aspects of MKM that do 
not explicitly address issues of very large knowledge collections; these we call "MKM 
in the large". 

In this paper we contribute to the MKM Grand Challenge of doing formal "MKM 
in the large" by analyzing scalability challenges inherent in MKM and propose steps 
towards solutions based on our Mmt format, which is the basis for the next version of 
OMDoc. We justify our conclusions and recommendations for scalability techniques 
with concrete case studies we have undertaken in the last years. Section 2 tackles scal- 
ability issues pertaining to the representation languages used in the formalization of 
mathematical knowledge. Section 3 discusses how the modularity features of Mmt can 
be realized scalably by realizing basic MKM functionality like validation, querying, 
and presentation incrementally and carefully evaluating the on-the-fly computation (and 
caching) of induced representations. These considerations, which are mainly concerned 
with efficient computation "in memory" are complemented with a discussion of mass 
storage, caching, and indexing in Section 4, which addresses scalabiUty issues in large 
collections of mathematical knowledge. Section 5 concludes the paper and addresses 
avenues of further research. 



2 A Scalable Representation Language 

Our representation language Mmt was introduced in [29]. It arises from three cen- 
tral design goals. Firstly, it should provide an expressive but simple module system as 
modularity is a necessary requirement for scalability. As usual in language design, the 
goals of simplicity and expressivity form a difficult trade-off that must be solved by 
identifying the right primitive module constructs. Secondly, scalability across semantic 
domains requires foundation-independence in the sense that Mmt does not commit 
to any particular foundation (such as Zermelo-Fraenkel set theory or Church's higher- 
order logic). Providing a good trade-off between this level of generality and the ability 
to give a rigorous semantics is a unique feature of Mmt. Finally, scalability across 
implementation domains requires standards-compliance, and while using XML and 
OpenMath is essentially orthogonal to the language design, the use of URIs as iden- 
tifiers is not as it imposes subtle constraints that can be very hard to meet a posteriori. 

Mmt represents logical knowledge on three levels: On the module level, Mmt 
builds on modular algebraic specification languages for logical knowledge such as OBJ 
[14], ASL [32], development graphs [1], and CASL [7]. In particular, Mmt uses theo- 
ries and theory morphism as the primitive modular concepts. Contrary to them, Mmt 
only imposes very lightweight assumptions on the underlying language. This leads to a 
very simple generic module system that subsumes almost all aspects of the syntax and 
semantics of specific module systems such as PVS [25], Isabelle [26], or Coq [3]. 

On the symbol level, Mmt is a simple declarative language that uses named sym- 
bol declarations where symbols may or may not have a type or a definiens. By experi- 
mental evidence, this subsumes virtually all declarative languages. In particular, Mmt 
uses the Curry-Howard correspondence [8,17] to represent axioms and theorem as con- 



slants, and proofs as terms. Sets of symbol declarations yield theories and correspond 
to OpenMath content dictionaries. 

On the object level, Mmt uses the formal grammar of OpenMath [6] to rep- 
resent mathematical objects without committing to a specific formal foundation. The 
semantics of objects is given proof theoretically using judgments for typing and equal- 
ity between objects. Mmt is parametric in these judgments, and the precise choice is 
relegated to a foundation. 

2.1 Module System 

Sophisticated mathematical reasoning usually involves several related but different math- 
ematical contexts, and it is desirable to exploit these relationships by moving theorems 
between contexts. It is well-known that modular design can reduce space to an extent 
that is exponential in the depth of the reuse relation between the modules, and this ap- 
plies in particular to the large theory hierarchies employed in mathematics and computer 
science. 

The first applications of this technique in mathematics are found in the works by 
Bourbaki ([4,5]), which tried to prove every theorem in the context with the smallest 
possible set of axioms. Mmt follows the "little theories approach" proposed in [11], in 
which separate contexts are represented by separate theories, and structural relation- 
ships between contexts are represented as theory morphisms, which serve as conduits 
for passing information (e.g., definitions and theorems) between theories (see [10]). 
This yields theory graphs where the nodes are theories and the paths are theory mor- 
phisms. 

Example 1 (Algebra). For example, consider the theory graph in Fig. 1 for a portion 
of algebra, which was formalized in Mmt in [9]. It defines the theory of magmas (A 
magma has a binary operation without axioms.) and extends it successively to monoids, 
groups, and commutative groups. Then the theory of rings is formed by importing from 
both CGroup (for the additive operation) and Monoid (for the multiplicative operation). 

A crucial property here is that the imports are named, e.g.. Monoid imports from 
Magma via an import named mag. While redundant in some cases, it is essential in Ring 
where we have to distinguish two theory morphisms from Monoid to Ring: The com- 
position add/grp/mon for the additive monoid and mult for the multiplicative monoid. 

The import names are used to form qualified names for the imported symbols. For 
example, if =1= is the name of the binary operation in Magma, then add/grp/mon/mag/* 
denotes addition and mult/mag/* multiplication in Ring. Of course, Mmt supports 
the use of abbreviations instead of qualified names, but it is a crucial prerequisite for 
scalability to make qualified names the default: Without named imports, every time we 
add a new name in Magma (e.g, for an abbreviation or a theorem), we would have to add 
corresponding renamings in Ring to avoid name clashes. 

Another reason to use named imports is that we can use them to instantiate imports 
with theory morphisms. In our example, distributivity is stated separately as a theory 
that imports two magmas. Let us assume, the left distributivity axiom is stated as 

Vx, y, z.x magi/* (y mag2/* z) = {x magi/* y) mag2/* (x magi/* z) 



Then the import dist from Distrib to Ring will carry the instantiations magi i— >■ 
mult /mag and mag2 i— )■ add/grp/mon/mag. 

In other module systems such as SML, such instantiations are called (asymmetric) 
sharing declarations. In terms of theory morphism, their semantics is a commutative 
diagram, i.e., an equality between two morphisms such as dist/magl = mult/mag : 
Magma — > Ring. This provides Mmt users and systems with a module level invariant 
for the efficient structuring of large theory graphs. 

Besides imports, which induce theory morphisms into the containing theory, there 
is a second kind of edge in the theory graph: Views are explicit theory morphisms that 
represent translations between two theories. For example, the node on the right side of 
the graph represents a theory for the integers, say declaring the constants 0, +, — , 1, and 
•. The fact that the integers are a commutative group is represented by the view vl: If 
we assume that Monoid declares a constant e for the unit element and Group a constant 
inv for the inverse element, then vl carries the instantiations grp/mon/mag/* h- > +, 
grp/mon/e i— 5- 1, and grp/inv i~-> — . Furthermore, every axiom declared or imported 
in CGroup is mapped to a proof of the corresponding property of the integers. 

The view v2 extends vl with corresponding instantiations for multiplication. Mmt 
permits modular views as well: When defining v2, we can import all instantiations of 
vl using add n> vl. As above, the semantics of such an instantiation is a commutative 
diagram, namely v2 o add = vl as intended. 

The major advantage of modu- Magma — > Monoid — > Group — > CGroup 

lar design is that every declaration / \ 

— abbreviations, theorems, nota- magi mag2 
tions etc. — effects induced dec- \ / 

larations in the importing theo- Distrib -, > Ring — ^ Integers 

ries. A disadvantage is that decla- ""^^ ^ 

rations may not always be located Fig. 1. Algebraic Hierarchy 

easily, e.g., the addition in a ring is 

declared in a theory that is four imports away. MMT finds a compromise here: Through 

qualified names, all induced declarations are addressable and locatable. The process 

of removing the modularity by adding all these induced declarations to all theories is 

called flattening. 

Case Study 1: The formalization in [9] uses the Twelf module system ([31]), which 
is a special case of Mmt. Twelf automatically computes the flattened theory graph. The 
modular theory graph including all axioms and proofs can be written in 180 lines of 
Twelf code. The flattened graph is computed in less than half a second and requires 
more than 1800 lines. 

The same case study defines two theories for lattices, one based on orderings and 
one based on algebra, and gives mutually inverse views to prove the equivalence of the 
two theories. Both definitions are modular: Algebraic lattices arise by importing twice 
from the theory of semi-lattices; order-based lattices arise by importing the infimum 
operation twice, once for the ordering and once for its dual. Consequently, the views 
can be given modularly as well, which is particularly important because views must 
map axioms to expensive-to-find proofs. These additional declarations take 310 lines of 
Twelf in modular and 3500 lines in flattened form. 
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These numbers already show the value of modularity in representation already in 
very small formaUzations. If this is lacking, later steps will face severe scalability prob- 
lems from blow-up in representation. Here, the named imports of Mmt were the crucial 
innovation to strengthen modularity. 

2.2 Foundation-Independence 

Mathematical knowledge is described using very different foundations, and the most 
common foundations can be grouped into set theory and type theory. Within each group 
there are numerous variants, e.g., Zermelo-Fraenkel or Godel-Bernays set theory, or 
set theories with or without the axiom of choice. Therefore, a single representation 
language can only be adequate if it is foundation-independent. 

OpenMath and OMDoc achieve this by concentrating on structural issues and 
leaving lexical ones to an external definition mechanism like content dictionaries or 
theories. In particular, this allows us to operate without choosing a particular founda- 
tional logical system, as we can just supply content dictionaries for the symbols in the 
particular logic. Thus, logics and in the same way foundations become theories, and we 
speak of the logics-as-theories approach. 

But conceptually, it is helpful to distinguish levels here. To state a property in the 
theory CGroup like commutativity of the operation o := grp/mon/mag/* as Va, b.a o 
b — bo a,we use symbols V and — from first-order logic together with o from CGroup. 
Even though it is structurally possible to build algebraic theories by simply importing 
first-order logic, this would fail to describe the meta-relationship between the theories. 
But this relation is crucial, e.g., when interpreting CGroup in the integers, the symbols 
of the meta-language are not interpreted because a fixed interpretation is given in the 
context. 

To understand this example better, we use the AI/T notation for meta-languages. 
M/T refers to working in the object language T, which is defined within the meta- 
language M. For example, most of mathematics is carried out in FOL/ZF, i.e., first- 
order logic is the meta-language, in which set theory is defined. FOL itself might be 
defined in a logical framework such as LF, and within ZF, we can define the language 
of natural numbers, which yields LF/ FOL/ZF /Nat. For algebra, we obtain, e.g., 
i^OL/Magma. Mmt makes this meta-relation explicit: Every theory T may point to 
another theory M as its meta-theory. We can write this as MALT/ (M/T). 

In Fig. 2, the algebra example is ex- 
tended by adding meta-theories. The the- Lp — 11:1^ Isabelle 
ory FOL for first-order logic is the meta- 
theory for all algebraic theories, and the 
theory LF for the logical framework LF is 
the meta-theory of FOL and of the theory 
HOL for higher-order logic. 

Now the crucial advantage of the 
logics-as-theories approach is that on all 
three levels the same module system can 

be used: For example, the views m and m' Fig. 2. Meta-Theories 

indicate possible translations on the levels 
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of logical frameworks and logics, respectively. Similarly, logics and foundations can be 
built modularly. Thus, we can use imports to represent inheritance at the level of logical 
foundations and views to represent formal translations between them. Just like in the 
little theories approach, we can prove meta-logical results in the simplest foundation 
that is expressive enough and then use views to move results between foundations. 

Example 2 (Little Logics and Little Foundations). In [15], we formalize the syntax, 
proof theory, and model theory and prove the soundness of first-order logic in Mmt. 
Using the module system, we can treat all connectives and quantifiers separately. Thus, 
we can reuse these fragments to define other logics, and in [18] we do that, e.g., for 
sorted first-order logic and modal logic. 

For the definition of the model theory, we need to formalize set theory in Mmt, 
which is a significant investment, and even then doing proofs in set theory — as needed 
for the soundness proof — is tedious. Therefore, in [16], we develop the set theoretical 
foundation itself modularly. We define a typed higher-order logic HDL first, which is 
expressive enough for many applications such as the above soundness proof. Then a 
view from HOL to ZF proves that ZF is a refinement of HDL and completes the proof of 
the soundness of FOL relative to models defined in ZF. 

Case Study 2: Ex. 2 already showed that it is feasible to represent foundations 
and relations between foundations in Mmt. Being able to this is a qualitative aspect 
of cross-domain scalability. In another case study, we represented LF/Isabelle and 
LF/Isabelle/HOL ([26,23]) as well as a translation from them into LF/FOL/ZFC 
(see [18]). 

To our knowledge, Mmt is the only declarative formalism in which comparable 
foundation or logic translations have been conducted. In Hets ([21]) a number of logic 
translations are implemented in Haskell. Twelf and Delphin provide logic and func- 
tional programming languages, respectively, on top of LF ([27,28]), which have been 
used to formalize the HOL-Nuprl translation ([22]). 

2.3 Symbol Identifiers "in tlie Large" 

In mathematical languages, we need to be able to refer to (i.e., identify) content objects 
in order to state the semantic relations. It was a somewhat surprising realization in the 
design of Mmt that to understand the symbol identifiers is almost as difficult as to 
understand the whole module system. Theories are containers for symbol declarations, 
and relations between theories define the available symbols in any given theory. Since 
every available symbol should have a canonical identifier, the syntax of identifiers is 
inherently connected to the possible relations between theories. 

In principle, there are two ways to identify content object: by location (relative to 
a particular document or file) and by context (relative to a mathematical theory). The 
first one essentially makes use of the organizational structure of files and file systems, 
and the second makes use of mathematical structuring principles supplied by the repre- 
sentation format. 

As a general rule, it is preferable to use identification by context as the distribution 
of knowledge over file systems is usually a secondary consideration. Then the mapping 



between theory identifiers and physical theory locations can be deferred to an extralin- 
guistic catalog. Resource identification by context should still be compatible with the 
URI-based approach that mediates most resource transport over the internet. This is 
common practice in scalable programming languages such as Java where package iden- 
tifiers are URIs and classes are located using the classpath. 

For logical and mathematical knowledge, the OpenMath 2 standard ([6]) and the 
current OMDoc version 1.2 define URIs for symbols. A symbol is identified by the 
symbol name and content dictionary, which in turn is identified by the CD name and 
the CD base, i.e., the URI where the CD is located. From these constituents, symbol 
URIs are formed using URI fragments (the part after the # delimiter). However, Open- 
Math imposes a one-CD-one-file restriction, which is too restrictive in general. While 
OMDocl.2 permits multiple theories per file, it requires file-unique identifiers for all 
symbols. In both cases, the use of URI fragments, which are resolved only on the client, 
forces cUents to retrieve the complete file even if only a single symbol is needed. 

Furthermore, many module systems have features that impede or complicate the for- 
mation of canonical symbol URIs. Such features include unnamed imports, unnamed 
axioms, overloading, opening of modules, or shadowing of symbol names. Typically, 
this leads to a non-trivial correspondence between user-visible and application-internal 
identifiers. But this impedes or complicates cross-application scalability where all ap- 
plications (ranging from, e.g., a Javascript GUI to a database backend) must understand 
the same identifiers. 

Mmt avoids the above pitfalls and introduces a simple yet expressive web-scalable 
syntax for symbol identifiers. An Mmt-URI is of the form doclmodlsym where 

- docisaURI without query or fragment, e.g., http : //cds . omdoc . org/math/algebra/algegral . omdoc 
which identifies (but not necessarily locates) an Mmt document, 

- mod is a /-separated sequence of local names that gives the path to a nested theory 
in the above document, e.g.. Ring, 

- syra is a /-separated sequence impi/ . . . /impn/con of local names such that 
impi is an import and con a symbol name, e.g., mult/mon/*, 

- a local name is of the form pchar^ where pchar is defined as in RFC 3986 [2], 
which — possibly via %-encoding — permits almost all Unicode characters. 

Inourrunningexample, the canonical URI of multiplication in a ring is http : //cds . omdoc . org/math/algel: 
Note that the use of two ? characters in a URI is unusual outside of Mmt, but le- 
gal w.rt. RFC 3986. Of course, Mmt also defines relative URIs that are resolved 
against the URI of the containing declaration. The most important case is when doc 
is empty. Then the resolution proceeds as in RFC 3986, e.g., Iraod'lsym! resolves to 
doclraod'lsym! relative to doclrnodlsym (Note that this differs from RFC 2396.). 
Mmt defines some additional cases that are needed in mathematical practice and go 
beyond the expressivity of relative URIs: Relative to doclmodlsym, the resolution of 
llsym' and 7 /mod'lsym' yields doclmodl sym' and doclmod / mod' 1 sym! , respec- 
tively. 

Case Study 3: URIs are the main data structure needed for cross-appUcation scala- 
bility, and our experience shows that they must be implemented by almost every periph- 
eral system, even those that do not implement Mmt itself. Already at this point, we had 
to implement them in SML ([31]), Javascript ([13]), XQuery ([35]), Haskell (forHets, 



[21]), and Bean Shell (for a jEdit plugin) — in addition to the Scala-based reference 
API (Sect. 3). 

This was only possible because Mmt-URIs constitute a well-balanced trade-off 
between mathematical rigor, feasibility, and URl-compatibility: In particular, due to the 
use of the two separators / and ? (rather than only one), they can be parsed locally, i.e., 
without access to or understanding of the surrounding Mmt document. And they can 
be dereferenced using standard URl libraries and URl-URL translations. At the same 
time, they provide canonical names for all symbols that are in scope, including those 
that are only available through imports. 



3 A Scalable Implementation 

As the implementation language for the Mmt reference API, we pick Scala, a program- 
ming language designed to be ,sca/able ([24]). Being functional, Scala permits elegant 
code, and based on and fully compatible with Java, it offers cross-application and web- 
level scalability. 

The Mmt API implements the syntax and semantics of Mmt. It compiles to a 
1 MB Java archive file that can be readily integrated into applications. Library and 
documentation can be obtained from [30]. Two technical aspects are especially notable 
from the point of view of scalability. Firstly, all Mmt functionality is exposed to non- 
Java applications via a scriptable shell and via an HTTP servlet. Secondly, the API 
maintains an abstraction layer that separates the backends that physically store Mmt 
documents (URLs) from the document identifiers (URIs). Thus, it is configurable which 
Mmt documents are located, e.g., in a remote database or on the local file system. In 
the following section we describe some of the advanced features. 

3.1 Validation 

Validation describes the process of checking Mmt theory graphs against the Mmt 
grammar and type system. Mmt validation is done in three increasingly strict stages. 

The first stage is XML validation against a context-free RelaxNG grammar As this 
only catches errors in the surface syntax, Mmt documents are validated structurally 
in a second stage. Structural validity guarantees that all declarations have unique URIs 
and that all references to symbols, theories, etc. can be resolved. This is still too lax 
for mathematics as it lacks type-checking. But it is exactly the right middle ground 
between the weak validation against a context-free grammar and the expensive and 
complex validation against a specific type system: On the one hand, it is efficient and 
foundation-independent, and on the other hand, it provides an invariant that is sufficient 
for many MKM services such as storage, navigation, or search. 

Type-checking is foundation-specific, therefore each foundation must provide an 
Mmt plugin that implements the typing and equality judgments. More precisely, the 
plugin must provide function that (semi-)decide for two given terms A and B over a 
theory T, the judgments \-t A = B and V-t A : B. Given such a plugin, a third valida- 
tion stage can refine structural validity by additionally validating well-typedness of all 



declarations. For scalability, it is important that (i) these plugins are stateless as the the- 
ory graph is maintained by Mmt, and that the (ii) modular structure is transparent to the 
plugin. Thus plugin developers only need to provide the core algorithms for the specific 
type system, and all MKM issues can be relegated to dedicated implementations. 

Context-free validation is well-understood. Moreover, Mmt is designed such that 
foundation-specific validation is obtained from structural validation by using the same 
inference system with some typing and equality constraints added. This leaves structural 
validation as the central issue for scalability. 

Case Study 4: We have implemented structural validation by decomposing an Mmt 
theory graph into a list of atomic declarations. For example, the declaration T — {si : 
Ti, S2 : T2} of a theory T with two typed symbols yields the atomic declarations 
T — {}, T?si : T, and Tls2 ■ T2. This "unnesting" of declarations is a special property 
of the Mmt type system that is not usually found in other systems. It is possible because 
every declaration has a canonical URI and can therefore be taken out of its context. 

This is important for scalability as it permits incremental processing. In particu- 
lar, large Mmt documents can be processed as streams of atomic declarations. Further- 
more, the semantics of Mmt guarantees that the processing order of these streams never 
matters if the (easily-inferrable) dependencies between declarations are respected. This 
would even permit parallel processing, another prerequisite for scalability. 

3.2 Querying 

Once a theory graph has been read, Mmt provides two ways how to access it: Mmt- 
URI dereferencing and querying with respect to a simple ontology. 

Firstly, a theory graph always has two forms: the modular form where all nodes are 
partial theories whose declarations are computed using imports, and the flattened form 
where all imports are replaced with translated copies of the imported declarations. Many 
implementations of module systems, e.g., Isabelle's locales, automatically compute the 
flat form and do not maintain the modular form. This can be a threat to scalability as it 
can induce combinatorial explosion. 

Mmt maintains only the modular form. However, as every declaration present in 
the flat form has a canonical URI, the access to the flat form is possible via Mmt-URI 
dereferencing: Dereferencing computes (and caches) the induced declarations present 
in the flat form. Thus, applications can ignore the modular structure and interact with a 
modular theory graph as if it were flattened, but the exponentially expensive flattening 
is performed transparently and incrementally. 

Secondly, the API computes the ABox of a theory graph with respect to the Mmt 
ontology. It has Mmt-URIs as individuals and 10 types like theory or symbol as 
unary predicates. 1 1 binary predicates represent relations between individuals such as 
HasDomain relating an import to a theory or HasOccurrenceOf InType relating 
two symbols. These relations are structurally complete: The structure of a theory graph 
can be recovered from the ABox. The computation time is negligible as it is a byproduct 
of validation anyway. 

The API includes a simple query language for this ontology. It can compute all 
individuals in the theory graph that are in a certain relation to a given individual. The 
possible queries include composition, union, transitive closure, and inverse of relations. 



The ABox can also be regarded as the result of compiling an Mmt theory graph. 
Many operations on theory graphs only requke the ABox: for example the computa- 
tion of the forward or backward dependency cone of a declaration which are needed to 
generate self-contained documents and in change management, respectively. This is im- 
portant for cross-application scalability because applications can parse the ABox very 
easily. Moreover, we obtain a notion of separate compilation: ABox-generation only 
requires structural validity, and the latter can be implemented if only the ABoxes of the 
referenced files are known. 

Case Study 5: Since all Mmt knowledge items have a globally unique Mmt-URI, 
being able to dereference them is sufficient to obtain complete information about a 
theory graph. We have implemented a web servlet for Mmt that can act as a local 
proxy for Mmt-URIs and as a URI catalog that maps Mmt-URIs into (local or remote) 
URLs. The former means that all Mmt-URIs are resolved locally if possible. The latter 
means that the Mmt-URI of a module can be independent from its physical location. 
The same servlet can be run remotely, e.g., on the same machine as a mathematical 
database and configured to retrieve files directly from there or from other servers. 

Thus systems can access all their input documents by URI via a local service, which 
makes all storage issues transparent. (Using presentation, see below, these can even 
be presented in the system's native syntax.) This solves a central problem in current 
implementations of formal systems: the restriction to in-memory knowledge. Besides 
the advantages of distributed storage and caching, a simple example application is that 
imported theories are automatically included when remote documents are retrieved in 
order to avoid successive lookups. 



3.3 Presentation 

Mmt comes with a declarative language for notations similar to [19] that can be used 
to transform Mmt theory graphs into arbitrary output formats. Notations are declared 
by giving parameters such as fixity and input/output precedence, and snippets for sep- 
arators and brackets. Notations are not only used for mathematical objects but also for 
all Mmt expressions, e.g. theory declarations and theory graphs. 

Two aspects are particularly important for scalability. Firstly, sets of notations (called 
styles) behave like theories, which are sets of symbols. In particular, styles and notations 
have Mmt-URIs (and are part of the Mmt ontology), and the Mmt module system can 
be used for inheritance between styles. 

Secondly, every Mmt expression has a URI E, for declarations this is trivial, for 
most mathematical objects it is the URI of the head symbol. Correspondingly, every 
notation must give an Mmt-URI N, and the notation is applicable to an expression if 
TV is a prefix of E. More specific notations can inherit from more general ones, e.g., the 
brackets and separators are usually given only once in the most general notation. This 
simplifies the authoring and maintenance of styles for large theory graphs significantly. 

Case Study 6: In order to present Mmt content as, e.g., HTML with embedded 
presentation MathML, we need a style with only the 20 generic notations given in 

http : //alpha . tntbase .mathweb . org/repos/cds/omdoc/mathml . omdoc. 



For example, the notation declara- 

<notation f or^"http : //cds . omdoc . org/" 

tion on the right apphes to all con- roie="constant"> 

stants whose cdbase starts with http : / / cds fttaStSfc P&¥q"/^°"^ 

<attribute name^"xref "> 

and renders OMS elements as mo elements. <text vaiue="#"/><ici/> 

The latter has an xref attribute that Hnks </attribute> 

<ho lex component index- "2 "/></hole> 

to the parallel markup (which is included by </eiement> 

notations at higher levels). The content of the </notation> 

mo elements is a "hole" that is by default filled with the second component, for con- 
stants that is the name (0 and 1 are cdbase and cd.). 

This scales well because we can give notations for specific theories, e.g., by saying 
that 1 Magma!* is associative infix and optionally giving a different operator symbol 
than *. We can also add other output formats easily: Our implementation (see [18]) 
extends the above notation with a j obad : hre f attribute containing the Mmt-URI — 
this URI is picked up by our JOBAD Javascript ([13]) for hyperlinking. 



4 A Scalable Database 

The TNTBase system [34] is a versioned XML-database with a client-server architec- 
ture. It integrates Berkeley DB XML into a Subversion server. DB XML stores HEAD 
revisions of XML files; non-XML content like PDF, images or LTj^ source files, dif- 
ferences between revisions, directory entry fists and other repository information are 
retained in a usual SVN back-end storage (Berkeley DB in our case). Keeping XML 
documents in DB XML allows accessing files not only via any SVN client but also 
through the DB XML API that supports efficient querying of XML content via XQuery 
and (versioned) modification of that content via XQuery Update. 

In addition, TNTBase provides a plugin architecture for document format-specific 
customizations [35]. Using OMDoc as concrete syntax for Mmt and the Mmt API as 
a TNTBase plugin, we have made TNTBase MMT-aware so that data-intensive Mmt 
algorithms can be executed within the database. 

The TNTBase system and its documentation are available at http : / /tntbase . org. 
Below we describe some of the features particularly relevant for scalability. 

4.1 Generating Content 

Large scale collaborative authoring of mathematical documents requires distributed 
and versioned storage. On the language end, Mmt supports this by making all iden- 
tifiers URIs so that Mmt documents can be distributed among authors and networks 
and reference each other. On the database end, TNTBase supports this by acting as a 
versioned Mmt database. 

In principle, versioning and distribution could also be realized with a plain SVN 
server But for mathematics, it is important that the storage backend is aware of at 
least some aspects of the mathematical semantics. In large scale authoring processes, 
an important requirement is to guarantee consistency, i.e., it should be possible to reject 
commits of invalid documents. Therefore, TNTBase supports document format-specific 
validation. 



For scalability, it is crucial that validation of interlinked collections of Mmt docu- 
ments is incremental, i.e., only those documents added or changed during a commit are 
validated. This is a significant effect because the committed documents almost always 
import modules from other documents that are akeady in the database, and these should 
not be revalidated — especially not if they contain unnecessary modules that introduce 
further dependencies. 

Therefore, we integrate Mmt separate compilation into TNTBase. During a com- 
mit TNTBase validates all committed files structurally by calling the Mmt API. After 
successful validation, the ABox is generated and immediately stored in TNTBase. Ref- 
erences to previously committed files are not resolved; instead their generated ABox is 
used for validation. Thus, validation is limited to the committed documents. 

Case Study 7: In the Latin project [18], we create an atlas of logics and logic 
translations formalized in Mmt. At the current early stage of the project 5 people are 
actively editing so far about 100 files. These contain about 200 theories and 50 views, 
which form a single highly inter-linked Mmt theory graph. We use TNTBase as the 
validity-guaranteeing backend storage. 

The Latin theory graph is highly modular For example, the document giving the 
set-theoretical model theory of first-order logic from [16] depends on about 100 other 
theories. (We counted them conveniently using an XQuery, see below.) Standalone vali- 
dation of this document takes about 15 seconds if needed files are retrieved from a local 
file system. Using separate compilation in TNTBase, it is almost instantaneous. 

In fact, we can configure TNTBase so that structural validation is preceded by Re- 
laxNG validation. This permits the Mmt application to drop inefficient checks for syn- 
tax errors. Similarly, structural validation could be preceded by foundation-specific val- 
idation, but often we do not have a well-understood notion of separate compilation for 
specific foundations. But even in that case, we can do better than naive revalidation. 
Mmt is designed so that it is foundation-independent which modules a given document 
depends on. Thus, we can collect these modules in one document using an efficient 
XQuery (see below) and then revalidate only this document. Moreover, we can use 
the presentation algorithm from Sect. 3.3 to transform the generated document into the 
input syntax of a dedicated implementation. 



4.2 Retrieving Content 

While the previous section already showed some of the strength of an MMT-aware 
TNTBase, its true strength lies in retrieving content. As every XML-native database, 
TNTBase supports XQuery but extends the DB XML syntax by a notion of file system 
path to address path-based collections of documents. Furthermore, it supports index- 
ing to improve performance of queries and the querying of previous revisions. Finally, 
custom XQuery modules can be integrated into TNTBase. 

MMT-aware retrieval is achieved through two measures. Firstly, ABox caching 
means that for every committed file, the Mmt ABox is generated and stored in TNT- 
Base. The ABox contains only two kinds of declarations — instances of unary and 
binary predicates — and is stored as a simple XML document. The element types in 
these documents are indexed, which yields efficient global queries. 



Example 3. An Mmt document for the algebra example from Sect. 2.1 is served at 
http://alpha.tntbase.mathweb.org/repos/cds/math/algebra/algebral.omdoc. Its ABox is cached 
at http://alpha.tntbase.mathweb.org: 8080/tntbase/cds/restful/integration/validation/mmt/content/ 
math/algebra/algebra 1 .omdoc. 

Secondly, custom XQuery functions utilize the cached and indexed ABoxes to 
provide efficient access to frequently needed aggregated documents. These include in 
particular the forward and backward dependency cones of a module. The backward 
dependency cone of a module M is the minimal set of modules needed to make M 
well-formed. Dually, the forward cone contains all modules that need M. If it were not 
for the indexed ABoxes, the latter would be highly expensive to compute: linear in the 
size of the database. 

Case Study 8: The Mmt presentation algorithm described in Sect. 3.3 takes a set 
of notations as input. However, additional notations may be given in imported theories, 
typically format-independent notations such as the one making ?Magma?* infix. There- 
fore, when an Mmt expression is rendered, all imported theories must be traversed for 
the sole reason of obtaining all notations. 

Without Mmt awareness in TNTBase, this would require multiple successive queries 
which is particularly harmful when presentation is executed locally while the imported 
theories are stored remotely. But even when all theories are available on a local disk, 
these successive calls already take 1 .5 seconds for the above algebra document. (Once 
the notations are retrieved, the presentation itself is instantaneous.) 

In MMT-aware TNTBase, we can retrieve all notations in the backward dependency 
closure of the presented expression with a single XQuery. ABox-indexing made this 
instantaneous up to network lag. 

TNTBase does not only permit the efficient retrieval of such generated documents, 
but it also permits to commit edited versions of them. We call these virtual documents 
in [35]. These are essentially "XML database views" analogous to views in relational 
databases. They are editable, and TNTBase transparently patches the differences into 
the original files in the underlying versioning system. 

Case Study 9: While manual refactoring of large theory graphs is as difficult as 
refactoring large software, there is virtually no tool support for it. For Mmt, we obtain 
a simple renaming tool using a virtual document for the one-step (i.e., non-transitive) 
forward dependency cone of a theory T (see [35] for an example). That contains all 
references to T so that T can be renamed and all references modified in one go. 

5 Conclusion and Future Work 

This paper aims to pave the way for MKM "in the large" by proposing a theoretical and 
technological basis for a "Universal Digital Mathematics Library" (UDML) which has 
been touted as the grand challenge for MKM. In a nutshell, we conclude that the prob- 
lem of scalability has be to addressed on all levels: we need modularity and accessibility 
of induced declarations in the representation format, incrementality and memoization 
in the implementation of the fundamental algorithms, and a mass storage solution that 



supports fragment access and indexing. We have developed prototypical implementa- 
tions and tested them on a variety of case studies. 

The next step will be to integrate the parts and assemble a UDML installation with 
these. We plan to use the next generation of the OMDoc format, which will integrate 
the MMT infrastructure described in this paper as an interoperability layer; see [20] for 
a discussion of the issues involved. In the last years, we have developed OMDoc trans- 
lation facilities for various fully formal theorem proving systems and their libraries. 
In the LATIN project [18], we are already developing a graph of concrete "logics-as- 
theories" to make the underlying logics interoperable. 
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